options(repos = c(CRAN = "https://cloud.r-project.org"))
#| warning: false
suppressPackageStartupMessages({
library(tidyverse)
library(here)
library(knitr)
library(kableExtra)
library(ggplot2)
library(plotly)
library(gridExtra)
})Jin_hp2
Introduction
This report is intended to observe and analyze the circumstances of trees in NYC through exploring the tree censuses conducted by New York City Department of Parks & Recreation in 1995, 2005, and 2015. Trees as one of the essential components of urban landscape provide a city with extensive vibrancy and vitality. Therefore, monitoring the condition of trees should have been a significant task.
About Dataset
There are 3 separate datasets regarding the tree conditions given, which were collected at 3 times with a 1-decade interval in between them as aforementioned. There is also another dataset about tree species which involve detailed information about each specific species of tree that grow in NYC. The number of trees across the 3 time phases demonstrates a slight increase but with the majority overlapped. Thus, since the 2015 census is the most recent one, I would mainly concentrate on that for analysis.
Setup
Since the dataset csv files except the tree species one are all over 100MB, it would be efficient to rather read them into independent rds files once for all.
here::i_am("Jin_hp2.qmd")here() starts at /Users/apple/Desktop/ling343-files/ling343-hp2
#| warning: false
#df_tree_species <- read.csv(here("new_york_tree_species.csv"))
#df_trees_1995 <- read.csv(here("new_york_tree_census_1995.csv"))
#df_trees_2005 <- read.csv(here("new_york_tree_census_2005.csv"))
#df_trees_2015 <- read.csv(here("new_york_tree_census_2015.csv"))
#write_rds(df_tree_species, "nyc_tree_species.rds")
#write_rds(df_trees_1995, "nyc_tree_census_1995.rds")
#write_rds(df_trees_2005, "nyc_tree_census_2005.rds")
#write_rds(df_trees_2015, "nyc_tree_census_2015.rds")df_species <- read_rds("nyc_tree_species.rds")
df_1995 <- read_rds("nyc_tree_census_1995.rds")
df_2005 <- read_rds("nyc_tree_census_2005.rds")
df_2015 <- read_rds("nyc_tree_census_2015.rds")Brief Overview of Censuses in Parallel
colnames(df_1995) [1] "recordid" "address" "house_number"
[4] "street" "zip_original" "cb_original"
[7] "site" "species" "diameter"
[10] "status" "wires" "sidewalk_condition"
[13] "support_structure" "borough" "x"
[16] "y" "longitude" "latitude"
[19] "cb_new" "zip_new" "censustract_2010"
[22] "censusblock_2010" "nta_2010" "segmentid"
[25] "spc_common" "spc_latin" "location"
colnames(df_2005) [1] "objectid" "cen_year" "tree_dbh" "tree_loc" "pit_type"
[6] "soil_lvl" "status" "spc_latin" "spc_common" "vert_other"
[11] "vert_pgrd" "vert_tgrd" "vert_wall" "horz_blck" "horz_grate"
[16] "horz_plant" "horz_other" "sidw_crack" "sidw_raise" "wire_htap"
[21] "wire_prime" "wire_2nd" "wire_other" "inf_canopy" "inf_guard"
[26] "inf_wires" "inf_paving" "inf_outlet" "inf_shoes" "inf_lights"
[31] "inf_other" "trunk_dmg" "zipcode" "zip_city" "cb_num"
[36] "borocode" "boroname" "cncldist" "st_assem" "st_senate"
[41] "nta" "nta_name" "boro_ct" "x_sp" "y_sp"
[46] "objectid_1" "location_1"
colnames(df_2015) [1] "tree_id" "block_id" "created_at" "tree_dbh" "stump_diam"
[6] "curb_loc" "status" "health" "spc_latin" "spc_common"
[11] "steward" "guards" "sidewalk" "user_type" "problems"
[16] "root_stone" "root_grate" "root_other" "trunk_wire" "trnk_light"
[21] "trnk_other" "brch_light" "brch_shoe" "brch_other" "address"
[26] "zipcode" "zip_city" "cb_num" "borocode" "boroname"
[31] "cncldist" "st_assem" "st_senate" "nta" "nta_name"
[36] "boro_ct" "state" "latitude" "longitude" "x_sp"
[41] "y_sp"
As can be seen from the summary of columns in each census, there are more or less variables measured and recorded.
As of 1995, there are 27 columns and 516989 observations.
As of 2005, there are 47 columns and 1777116 observations.
As of 2015, there are 41 columns and 683788 observations.
The number of tree observations in 2005 census is abruptly high. If we take a closer look at the actual data frame, it can be noted that there are 2 pieces of information for each tree which are not formatted correctly then unnecessarily take 2 extra rows per actual tree observation.
head(df_2005[, c(1,2,3)], 6) objectid cen_year tree_dbh
1 1164781 2005 9
2 New York NA NA
3 (40.557117429000002 -74.158024325000000) NA NA
4 1017551 2006 7
5 New York NA NA
6 (40.771243948399999 -73.911987843600002) NA NA
Let’s tidy this into 1 single row per observation with the current 2nd, 5th rows that merely state “New York” to be removed and the coordinates to be added as a new column.
df_2005 <- df_2005 %>%
mutate(row_idx = row_number()) %>%
select(row_idx, everything())
df_2005 <- df_2005 %>%
mutate(coordinates = ifelse(row_idx %% 3 == 0, objectid, NA)) %>%
fill(coordinates, .direction = "up")
df_2005 <- df_2005 %>%
filter(!(row_idx %% 3 == 0 | (row_idx + 1) %% 3 == 0)) %>%
select(-row_idx)
df_2005 <- df_2005 %>%
separate(coordinates, into = c("latitude", "longitude"), sep = " -", extra = "merge") %>%
mutate(
latitude = gsub("[()]", "", latitude),
longitude = gsub("[()]", "", longitude)
)Warning: Expected 2 pieces. Missing pieces filled with `NA` in 8840 rows [181, 398, 452,
508, 517, 668, 738, 761, 766, 800, 802, 808, 1011, 1320, 1342, 1398, 1639,
1748, 1765, 1792, ...].
head(df_2005[, c("objectid", "latitude", "longitude"), 6]) objectid latitude longitude
1 1164781 40.557117429000002 74.158024325000000
2 1017551 40.771243948399999 73.911987843600002
3 730461 40.685647932400002 73.981510094699999
4 718618 40.619671549499998 74.018853453600002
5 926349 40.744177770000000 73.857010341099993
6 741134 40.676315822399999 73.982255347000006
df_2005 now has its previous tree location coordinates in every 3rd row reorganized into 2 separate columns as latitude and longitude.
And df_2005 now has 592372 observations, which is reflective of the actual number of trees recorded.
Census of Trees in 2015
Data Dictionary
data_dict <- data.frame(Variable = c("tree_idx", "tree_dbh", "stump_diam", "curb_loc", "status", "health",
"spc_latin", "spc_common", "steward", "guards", "sidewalk", "problems",
"address", "zipcode", "borocode", "boroname", "nta", "nta_name",
"latitude", "longitude"),
Explanation = c("unique index to distinguish each tree",
"diameter of a tree at breast height",
"diameter of a tree that has been turned into a stump",
"condition of curb applied to a tree {OnCurb | OffsetFromCurb}",
"life status of a tree {Alive | Stump | Dead}",
"health status of a tree {Good | Fair | Poor}",
"latin name of a tree's species",
"common name of a tree's species",
"whether a tree has stewards {None | 1or2 | 3or4}",
"whether a tree has guards {Helpful | Harmful | None | Unsure}",
"whether a tree affects nearby sidewalk {Damage | No Damage}",
"the type of problem a tree has or none",
"address of where a tree is located",
"zipcode of a tree's detailed location",
"code of the borough where a tree is located",
"name of the borough where a tree is located",
"Neighborhood Tabulation Area designated by NYC Department of City Planning",
"name of the Neighborhood Tabulation Area where a tree is at",
"lat, indicated by its name",
"long, indicated by its name"))
kable(data_dict, caption = "Data Dictionary")| Variable | Explanation |
|---|---|
| tree_idx | unique index to distinguish each tree |
| tree_dbh | diameter of a tree at breast height |
| stump_diam | diameter of a tree that has been turned into a stump |
| curb_loc | condition of curb applied to a tree {OnCurb | OffsetFromCurb} |
| status | life status of a tree {Alive | Stump | Dead} |
| health | health status of a tree {Good | Fair | Poor} |
| spc_latin | latin name of a tree’s species |
| spc_common | common name of a tree’s species |
| steward | whether a tree has stewards {None | 1or2 | 3or4} |
| guards | whether a tree has guards {Helpful | Harmful | None | Unsure} |
| sidewalk | whether a tree affects nearby sidewalk {Damage | No Damage} |
| problems | the type of problem a tree has or none |
| address | address of where a tree is located |
| zipcode | zipcode of a tree’s detailed location |
| borocode | code of the borough where a tree is located |
| boroname | name of the borough where a tree is located |
| nta | Neighborhood Tabulation Area designated by NYC Department of City Planning |
| nta_name | name of the Neighborhood Tabulation Area where a tree is at |
| latitude | lat, indicated by its name |
| longitude | long, indicated by its name |
species_count <- df_2015 %>%
filter(!is.na(spc_common) & spc_common != "") %>%
group_by(spc_common) %>%
summarize(count = n()) %>%
rename(tree_species = spc_common) %>%
arrange(desc(count))
species_count$tree_species <- reorder(species_count$tree_species, -species_count$count)
spc_top_bar_chart <- ggplot(head(species_count, 30),
aes(x = tree_species, y = count, fill = count)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "lightgreen", high = "darkgreen", "Count") +
labs(title = "Number of Trees of Top 30 Species in NYC 2015", x = "Tree Species", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 80, vjust = 0.5, hjust = 1))
spc_top_plotly <- ggplotly(spc_top_bar_chart)
spc_top_plotlyspecies_count <- species_count %>%
arrange(count)
species_count$tree_species <- reorder(species_count$tree_species, species_count$count)
spc_few_bar_chart <- ggplot(head(species_count, 30),
aes(x = tree_species, y = count, fill = count)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "yellow", high = "#F97A27", "Count") +
labs(title = "Number of Trees of Rarest 30 Species in NYC 2015", x = "Tree Species", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 80, vjust = 0.5, hjust = 1))
spc_few_plotly <- ggplotly(spc_few_bar_chart)
spc_few_plotlyTree diameter (tree_dbh) or stump diameter (stump_diam)
tree_or_diam <- df_2015 %>%
filter(!(tree_dbh == 0 & stump_diam == 0))
summary(tree_or_diam$tree_dbh[tree_or_diam$tree_dbh != 0]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 5.00 10.00 11.58 16.00 450.00
summary(tree_or_diam$stump_diam[tree_or_diam$stump_diam != 0]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 7.00 14.00 16.75 23.00 140.00
count_non_zero_stump_diam <- sum(tree_or_diam$stump_diam > 0)
count_non_zero_stump_diam [1] 17654
count_zero_tree_dbh <- sum(tree_or_diam$tree_dbh == 0)
count_zero_tree_dbh[1] 17654
Excluding the observations that have both tree_dbh and stump_diam columns entered as 0, it can be compared to see that the number of trees as a stump with stump_diam > 0 and the number of trees not in normal status with tree_dbh = 0 are equal. It means that either a tree is in its normal tree status or has been turned into a stump for some unknown reason.
Thus, as of 2015, there are 665856 trees and 17654 stumps, despite the trees with neither diameter logged.
Next, we may look into another relevant facet of the given tree observations, the health status.
status_tb <- table(df_2015$status)
status_df <- data.frame(Status = names(status_tb), Count = as.vector(status_tb))
status_df <- status_df %>%
arrange(desc(Count))
colors <- c("#83E16E", "#ff8a00", "#DFC336")
life_status_pie <- plot_ly(status_df, labels = ~Status, values = ~Count, type = 'pie', textinfo = 'label+percent',
insidetextorientation = 'radial',
marker = list(colors = colors)) %>%
layout(title = list(text = 'Pie Chart of Tree Life Status in 2015', x = 0, xanchor = 'left',
font = list(size = 14)),
margin = list(l = 40, r = 40, b = 40, t = 70))
life_status_piehealth_empty <- sum(trimws(df_2015$health) == '')
health_empty[1] 31616
There are 31616 observations with no health value logged, which are then excluded.
health_df_filtered <- df_2015 %>%
filter(health != '') %>%
select(health)
health_tb <- table(health_df_filtered$health)
health_tb_df <- data.frame(Health_status = names(health_tb), Count = as.vector(health_tb))
colors2 <- c("#38AAC3", "#56AB5D", "#E8BE12")
health_pie <- plot_ly(health_tb_df, labels = ~Health_status, values = ~Count, type = 'pie',
textinfo = 'label+percent',
insidetextorientation = 'radial',
marker = list(colors = colors2)) %>%
layout(title = 'Pie Chart of Tree Health Status in 2015')
health_piestatus_combined <- df_2015 %>%
filter(status != '') %>%
select(tree_id, status, health)
status_combined <- status_combined %>%
rename(life_status = status, health_status = health)There are 683788 tree observations which have both life_status and health_status values present.
status_comb_summary <- status_combined %>%
group_by(life_status, health_status) %>%
summarise(Count = n(), .groups = 'drop')
status_comb_summary <- status_comb_summary %>%
filter(!(life_status == 'Alive' & health_status == '')) %>%
arrange(desc(Count))
status_comb_kb <- kable(status_comb_summary,
format = "html",
caption = "Frequency of Combinations of Life_status and Health_status")
status_comb_kb| life_status | health_status | Count |
|---|---|---|
| Alive | Good | 528850 |
| Alive | Fair | 96504 |
| Alive | Poor | 26818 |
| Stump | 17654 | |
| Dead | 13961 |
Did the specific problems listed in the census affect the trees’ either status?
no_log_tree_cnt <- sum(df_2015$problems == '')
no_problem_tree_cnt <- sum(df_2015$problems == 'None')
problem_tree_cnt <- sum(df_2015$problems != '' & df_2015$problems != 'None')In general, there are 31664 trees with problem column not logged, 426280 trees with no problem, and 225844 trees with some problem to be further probed.
tree_problem_df <- df_2015[, c(2, 11, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)]
colnames(tree_problem_df) [1] "block_id" "steward" "guards" "problems" "root_stone"
[6] "root_grate" "root_other" "trunk_wire" "trnk_light" "trnk_other"
[11] "brch_light" "brch_shoe" "brch_other"
Since we have been aware that ‘problems’ column indicates whether a tree encounters some or none of the enumerated problems. It would be efficient for us to exclude those trees from our analysis of trees’ problems.
problem_cnt_df <- data.frame(Problem_type = c("Not Logged", "No Problem", "Problem"),
Tree_count = c(no_log_tree_cnt, no_problem_tree_cnt, problem_tree_cnt))
problem_cnt_df <- problem_cnt_df %>%
arrange(Tree_count)
problems_bar_chart <- ggplot(problem_cnt_df, aes(x = Problem_type, y = Tree_count)) +
geom_bar(stat = "identity", fill = "#7FB3D5", width = 0.5) +
labs(title = "Bar Plot of Tree Problems Count as of 2015",
x = "Problem_Type",
y = "Tree_Count") +
theme_minimal()
problems_bar_plotly <- ggplotly(problems_bar_chart)
problems_bar_plotlyIf we take a deeper look into the subset of trees with problem, we would be able to see the proportions of each specific problem type.
specific_problem_type_cnt <- df_2015 %>%
group_by(problems) %>%
summarize(Count = n())
specific_problem_type_cnt <- specific_problem_type_cnt %>%
arrange(desc(Count))
specific_problem_type_cnt <- specific_problem_type_cnt[-c(1, 3), ]
#colnames(specific_problem_type_cnt)
specific_problem_type_cnt <- specific_problem_type_cnt %>%
rename(problem_type = problems)problem_bar_chart <- ggplot(head(specific_problem_type_cnt, 25),
aes(x = problem_type, y = Count, fill = Count)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "#A9CCE3", high = "#1A5276", "Count") +
labs(title = "Number of Trees with the Most Frequent 25 Combo of Problems",
x = "Problem",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 80, vjust = 0.5, hjust = 1, size = 8))
problem_plotly <- ggplotly(problem_bar_chart)
problem_plotlyAs can be noted from the above chart, stones, branch lights are the two problems that occur the most among the trees that were investigated in NYC in 2015.
Among the 9 specific problems enumerated in the dataset, what are the proportions of each occurring or not?
create_new_df_from_column <- function(column) {
counts <- table(column)
new_df <- data.frame(bool = names(counts), Count = as.integer(counts))
names(new_df) <- c("bool", "Count")
return(new_df)
}
colors <- c("#B4720B", "#8FBF53")
new_problem_pie <- function(df) {
df$Percentage <- df$Count / sum(df$Count) * 100
new_pie <- ggplot(df, aes(x = "", y = Count, fill = factor(bool))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
theme_void() +
labs(fill = "Bool") +
theme(legend.title = element_text(size = 7), legend.text = element_text(size = 7)) +
geom_text(aes(label = sprintf("%0.1f%%", Percentage)), position = position_stack(vjust = 0.5), size = 3) +
scale_fill_manual(values = colors)
return (new_pie)
}
3 Root Problems
root_stone_df <- create_new_df_from_column(tree_problem_df[5])
root_grate_df <- create_new_df_from_column(tree_problem_df[6])
root_other_df <- create_new_df_from_column(tree_problem_df[7])
root_stone_pie <- new_problem_pie(root_stone_df)
root_grate_pie <- new_problem_pie(root_grate_df)
root_other_pie <- new_problem_pie(root_other_df)
grid.arrange(root_stone_pie, root_grate_pie, root_other_pie, ncol = 3,
top = "Combined Pie Charts for 3 Types of Root Problems")The above 3 pie charts are for problems root_stone, root_grate, and root_other from left to right.
For instance, 20.5% of all trees in 2015 census had no root_stone problem while the rest 79.5% did have.
3 Trunk Problems
trunk_wire_df <- create_new_df_from_column(tree_problem_df[8])
trunk_light_df <- create_new_df_from_column(tree_problem_df[9])
trunk_other_df <- create_new_df_from_column(tree_problem_df[10])
trunk_wire_pie <- new_problem_pie(trunk_wire_df)
trunk_light_pie <- new_problem_pie(trunk_light_df)
trunk_other_pie <- new_problem_pie(trunk_other_df)
grid.arrange(trunk_wire_pie, trunk_light_pie, trunk_other_pie, ncol = 3,
top = "Combined Pie Charts for 3 Types of Trunk Problems")The above 3 pie charts are for problems trunk_wire, trunk_light, and trunk_other from left to right.
3 Branch Problems
branch_light_df <- create_new_df_from_column(tree_problem_df[11])
branch_shoe_df <- create_new_df_from_column(tree_problem_df[12])
branch_other_df <- create_new_df_from_column(tree_problem_df[13])
branch_light_pie <- new_problem_pie(branch_light_df)
branch_shoe_pie <- new_problem_pie(branch_shoe_df)
branch_other_pie <- new_problem_pie(branch_other_df)
grid.arrange(branch_light_pie, branch_shoe_pie, branch_other_pie , ncol = 3,
top = "Combined Pie Charts for 3 Types of Branch Problems")The above 3 pie charts are for problems branch_light, branch_shoe, and branch_other from left to right.
Observation: Comparing the 3 sets of pie charts in parallel, it can be concluded that root_stone from set 1 and branch_light from set 3 are the 2 problems with the highest proportions of “Yes” which denotes that a tree observation did have that particular problem. Also, the results demonstrated in these pie charts are in line with the results computed from counting the frequency of different problem types using the overall problem_type column.
Whether having steward and/or guards better protects the trees?
steward_tb <- table(tree_problem_df$steward)
steward_tb_prop <- prop.table(steward_tb)
steward_tb
1or2 3or4 4orMore None
31615 143557 19183 1610 487823
steward_tb_prop
1or2 3or4 4orMore None
0.046235090 0.209943725 0.028054017 0.002354531 0.713412637
guards_tb <- table(tree_problem_df$guards)
guards_tb_prop <- prop.table(guards_tb)
guards_tb
Harmful Helpful None Unsure
31616 20252 51866 572306 7748
guards_tb_prop
Harmful Helpful None Unsure
0.04623655 0.02961737 0.07585099 0.83696409 0.01133100
protection_comb_summary <- tree_problem_df %>%
filter(!tree_problem_df$steward == '' & !tree_problem_df$guards == '') %>%
group_by(steward, guards) %>%
summarise(Count = n(), .groups = 'drop') %>%
arrange(desc(Count))
kable(protection_comb_summary)| steward | guards | Count |
|---|---|---|
| None | None | 474562 |
| 1or2 | None | 94101 |
| 1or2 | Helpful | 32205 |
| 1or2 | Harmful | 12626 |
| 3or4 | Helpful | 11451 |
| None | Helpful | 7234 |
| 1or2 | Unsure | 4625 |
| None | Harmful | 4022 |
| 3or4 | Harmful | 3422 |
| 3or4 | None | 3286 |
| None | Unsure | 2004 |
| 3or4 | Unsure | 1024 |
| 4orMore | Helpful | 976 |
| 4orMore | None | 357 |
| 4orMore | Harmful | 182 |
| 4orMore | Unsure | 95 |
Conclusion
Throughout this report, I tried to analyze several segments of the census concerning trees in NYC in 2015. The dataset provided has a number of variables measured and recorded for the trees as observations. Accordingly, there could be a lot to unpack and map, whereas the report here could only incorporate analysis of limited facets adapted from the census. I started by looking at the general aspects of the trees, such as diameter (dbh) and life status, and then went deeper into the particularities that possibly could have some correlation in between, which doesn’t turn out to be explicit. Overall, it’s quite interesting to explore the tree dynamics in a metropolis.
Dataset Reference
https://www.kaggle.com/datasets/nycparks/tree-census/data